Set the below parameters for the downstream analysis
#Give names to saved object and name of results repository
name_result_folder<-'TS_results_PBMC_example/'
obj_name<-'timeSeries_obj_PBMC_example.Rdata'
#Path to count data and sample data respectively
#Not needed if using example data
my_path_data<-'data/PBMC/raw_counts_TS'
my_path_sample_dta<-'data/PBMC/sample_file.csv'
#Set-up time series object parameters
diff_exp_type<-'DESeq2' #package used for DE – can also be 'limma'
p_val_filter_type<-'padj' #Either padj or pvalue, used to filter for significance
p_thresh<-0.05 #pvalue or padj value threshold for significance
l2fc_thresh<-1 #log(2)foldChange threshold for significance
name_control<-'LPS' #Name of experiment as seen in the sample file
name_experiment<-'IgM' #Name of control as seen in the sample file
graphic_vector<-c("#e31a1c","#1f78b4") #Pre-set colors for the groups
#Declare organism and load library, install if necessary
org_sem_sim='org.Hs.eg.db'
if(system.file(package=org_sem_sim) ==''){#Package not installed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(org_sem_sim)
}
library('org.Hs.eg.db')
## Loading required package: AnnotationDbi
##
#Define specimen and ontology parameters
my_ont_gpro='GO:BP'
my_ont_sem_sim='BP'
#mmusculus
#hsapiens
#celegans
my_org_gpro='hsapiens' #Set the species for the gprofiler analysis
# The seed serves to create reproducible results with PART.
# A seed will ensure that the random components of PART clustering will be the same
# as long as the same seed is used. For more information on seeds, please consult
# this link: https://www.r-bloggers.com/2018/07/%F0%9F%8C%B1-setting-a-seed-in-r-when-using-parallel-simulation/
PART_seed=123456
PART_l2fc<-1 #log(2)foldChange threshold for PART clustering
PART_min_clust<-50 #Minimum cluster size for PART
PART_recursion<-100 #Number of recursions, default is 100, using 10 for example
log_tp_traj<-FALSE #Defines if timepoints should be log transformed for illustration purposes
# Allows for all temporal combinations to be done instead
# of just sequential comparison. ex: do TP2vsTP1, TP3vsTP2, AND TP3vsTP1. In a normal instance
# only the first two comparison of the example would be run.
do_all_temporal_comparisons=FALSE
#Used to highlight specific genes regardless of differential gene expression significance
genes_of_interest <- c('AICDA','APOBEC3H','APOBEC3F','APOBEC3D','APOBEC3C','APOBEC3G','APOBEC3B','APOBEC3A','SMUG1','UNG','EGFR')
#The ancestors that will be queried, the ontology must be specified (BP,MF,or CC)
#Set to an empty vector c() if not required by the analysis
target_ancestors<-c('GO:0002253','GO:0019882','GO:0002404','GO:0002339','GO:0042386',
'GO:0035172','GO:0002252','GO:0006955','GO:0002520','GO:0090713',
'GO:0045321','GO:0001776','GO:0050900','GO:0031294','GO:0002262',
'GO:0002683','GO:0002684','GO:0002440','GO:0002682','GO:0002200',
'GO:0045058','GO:0002507')
ancestor_ontology<-'BP'
#Some extra set-up
name_save_obj<-paste0(name_result_folder,obj_name)#The object will be saved in result folder
#Create main directory for results
dir.create(name_result_folder)
my_group_names<-c(name_experiment,name_control)
names(graphic_vector)<-c(name_experiment,name_control)
This code chunk first checks if a timeseries object has been saved,
if that is the case, the object is loaded instead of being created
again. To reset this analysis, the saved timeseries object should be
deleted or moved from the repository.
If no timeseries object was found, the code chunk creates the timeseries
object using the parameters defined above. The raw matrix is also
created using the second function
if(obj_name %in% list.files(name_result_folder)==FALSE){
TS_object <- new('TimeSeries_Object',
group_names=my_group_names,group_colors=graphic_vector,DE_method=diff_exp_type,
DE_p_filter=p_val_filter_type,DE_p_thresh=p_thresh,DE_l2fc_thresh=l2fc_thresh,
PART_l2fc_thresh=PART_l2fc,sem_sim_org=org_sem_sim,Gpro_org=my_org_gpro)
# TS_object <- TS_load_example_data(TS_object,'PBMC')
TS_object <- add_experiment_data(TS_object,sample_dta_path=my_path_sample_dta,count_dta_path=my_path_data)
TS_object <- add_semantic_similarity_data(TS_object,my_ont_sem_sim)
}else{
load(name_save_obj)
}
Loads the necessary functions to perform differential gene expression analysis If the tool used is DESeq2(Love, Huber, and Anders 2014), the data needs to be normalized using it’s method. If the tool used is limma(Smyth 2005), the normalized matrix should have been inputed, and thus no normalization is needed.
This code chunk performs both the conditional and temporal differential gene expression and saves the results within the Time Series object.
#Perform normalization if the DESeq2 tool is being used and if normalized matrix doesn't exist
if (slot(TS_object,'DE_method')=='DESeq2' & 'norm' %in% names(assays(slot(TS_object,'exp_data')))!=TRUE){
TS_object <- normalize_timeSeries_with_deseq2(time_object=TS_object)
}
#Perform conditional differential gene expression analysis
TS_object<-conditional_DE_wrapper(TS_object)
#Perform temporal differential gene expression analysis
TS_object<-temporal_DE_wrapper(TS_object,do_all_combinations=do_all_temporal_comparisons)
#save the timeseries object
save(TS_object,file=name_save_obj)
The code chunk below prepares and initiates PART clustering(Nilsen et al. 2013). It first retrieves the number of significant genes for PART clustering in the ‘signi_genes’ variable This chunk can be quite lengthy depending on the number of genes included for clustering as well as the number of recursions set for the clustering
#Extract genes for PART clustering based on defined log(2)foldChange threshold
signi_genes<-select_genes_with_l2fc(TS_object)
sample_data<-exp_sample_data(TS_object)
#Use all samples, but implement a custom order. In this case it is reversed
samps_2<-sample_data$sample[sample_data$group==TS_object@group_names[2]]
samps_1<-sample_data$sample[sample_data$group==TS_object@group_names[1]]
#Create the matrix that will be used for PART clustering
TS_object<-prep_counts_for_PART(object=TS_object,target_genes=signi_genes,
scale=TRUE,target_samples=c(samps_2,samps_1))
#Sets a seed for reproducibility
if (is.null(PART_seed)==FALSE){
set.seed(as.character(PART_seed))
}
TS_object<-compute_PART(TS_object,part_recursion=PART_recursion,part_min_clust=PART_min_clust,
custom_seed=PART_seed,dist_param="euclidean", hclust_param="average")
#Save the TimeSeries object to the directory to prevent loss of data in the event
#of a downstream error
save(TS_object,file=name_save_obj)
The code chunk below runs a gprofiler analysis(Kolberg et al. 2020; Raudvere et al. 2019). IMPORTANT: This function requires that there be a stable internet connection lack of connection or intermittent drops in the connection will result in an error and the termination (stop) of the code chunk
If an error has occured, this code chunk can be re-run separately from the above chunks by uncommenting (removing the ‘#’ in front) the “load(‘timeseries_obj_res.Rdata’)” line.
This will load the results saved after the PART clustering code chunk
This code chunk will overwrite the saved object if it is completed. The overwritten object will then contain the gprofiler analysis results and can be used to generate plots with the downstream code chunks.
# load('timeseries_obj_res.Rdata')
TS_object<-run_gprofiler_PART_clusters(TS_object,vignette_run = TRUE) #Run the gprofiler analysis
#Save the results in a Rdata object
save(TS_object,file=name_save_obj)
Most plots are created in SVG format for ease of editing with SVG editing software such as InkScape (open source software). Some plots are created in html format as they are interactive plots. These require a web browser to open them
To convert SVG files to PNG/JPG/PDF, this website is available: https://svgtopng.com/
HTML files can be opened and then saved as PNG (or other) using the
camera icon in the top right of each respective interactive plot
Name of groups beings compared: IgM vs LPS Number of genes analyzed: 26485
Differential expression parameters used: Method: DESeq2 P statistic filter used: padj with a 0.05 threshold Log(2)FoldChange significance threshold used: 1
PART parameters used: PART log(2)FoldChange significance threshold: 1 number of recursions: 100 minimum cluster size: 50 distance parameter: euclidean hclust parameter: average custom seed used: 123456 PART computation time: 5492.877 sec elapsed
PCA of sample data:
DEGs found per analysis:
| experiment.name | number.of.DEGs |
|---|---|
| IgM_vs_LPS_TP_1 (conditional) | 1763 |
| IgM_vs_LPS_TP_3 (conditional) | 1369 |
| IgM_vs_LPS_TP_9 (conditional) | 1568 |
| TP_3_vs_TP_1 (temporal) | 957 |
| TP_9_vs_TP_3 (temporal) | 86 |
| total unique genes | 3884 |
Each experiment listed in the above table has separate results which
can be viewed in the ‘TS_results’ folder. In the main results folder,
the differential gene expression results are located in the
‘DE_results_conditional’ and ‘DE_results_temporal’ plot.
Each individual experiment creates the following:
DE_raw_data.csv: A csv file containing all the genes
and their associated differential expression results, regardless of
significance.
DE_sig_data: A csv file containing only the significant
differential expression results.
MA_plot.png: A MA plot, these are often used to
evaluate the quality of the normalization.
volcano_plot.png: A volcano plot which splits
significantly up-regulated and down-regulated genes as well as
non-significant genes.
PCA_plot: A PCA plot using only the samples implicated
in the differential gene expression analysis.
The pipeline also creates two large heatmaps which summarize
conditional and temporal differential expression results. The heatmaps
were created using the ComplexHeatmaps package(Gu, Eils, and Schlesner 2016).
These heatmaps illustrate each experiment as different columns
(distinguished by color) while the rows distinguish the groups. Plotted
in the heatmaps is the log transformed counts, while the log transformed
log(2)fold change is seen in a histogram at the bottom of the heatmap.
The number of genes in each experiment (colored columns) is indicated in
the legend at the bottom of the heatmap.
In the case of the temporal heatmap, the rows are labelled ‘experiment’
and ‘control’. The ‘experiment’ represents the time points being
compared while the control represents the timepoint which the
‘experiment’ is compared against. For example, TP2_vs_TP1 would have the
‘experiment’ be TP2 and the control be TP1, while TP3_vs_TP2 would have
TP3 as the ‘experiment’ and TP2 as the ‘control’. Note that TP=time
point.
The conditional heatmap